Normalization of non-standard words

نویسندگان

  • Richard Sproat
  • Alan W. Black
  • Stanley F. Chen
  • Shankar Kumar
  • Mari Ostendorf
  • Christopher Richards
چکیده

In addition to ordinary words and names, real text contains non-standard “words” (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary “letter-to-sound” rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types—news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems. c © 2001 Academic Press Author for correspondence: AT&T Labs–Research, Shannon Laboratory, Room B207, 180 Park Avenue, PO Box 971, Florham Park, NJ 07932-0000, U.S.A. E-mail: [email protected] 0885–2308/01/030287 + 47 $35.00/0 c © 2001 Academic Press 288 R. Sproat et al.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Normalization of Non-Standard Words in Croatian Texts

This paper presents text normalization which is an integral part of any text-to-speech synthesis system. Text normalization is a set of methods with a task to write non-standard words, like numbers, dates, times, abbreviations, acronyms and the most common symbols, in their full expanded form are presented. The whole taxonomy for classification of non-standard words in Croatian language togethe...

متن کامل

Context Tailoring for Text Normalization

Language processing tools suffer from significant performance drops in social media domain due to its continuously evolving language. Transforming non-standard words into their standard forms has been studied as a step towards proper processing of ill-formed texts. This work describes a normalization system that considers contextual and lexical similarities between standard and non-standard wor...

متن کامل

Normalization of Text Messages Using Character- and Phone-based Machine Translation Approaches

There are many abbreviation and non-standard words in SMS and Twitter messages. They are problematic for text-to-speech (TTS) or language processing techniques for these data. A character-based machine translation (MT) approach was previously used for normalization of non-standard words. In this paper, we propose a two-stage translation method to leverage phonetic information, where non-standar...

متن کامل

A Graph-based Approach for Contextual Text Normalization

The informal nature of social media text renders it very difficult to be automatically processed by natural language processing tools. Text normalization, which corresponds to restoring the non-standard words to their canonical forms, provides a solution to this challenge. We introduce an unsupervised text normalization approach that utilizes not only lexical, but also contextual and grammatica...

متن کامل

Tweet Normalization with Syllables

In this paper, we propose a syllable-based method for tweet normalization to study the cognitive process of non-standard word creation in social media. Assuming that syllable plays a fundamental role in forming the non-standard tweet words, we choose syllable as the basic unit and extend the conventional noisy channel model by incorporating the syllables to represent the word-to-word transition...

متن کامل

Joint POS Tagging and Text Normalization for Informal Text

Text normalization and part-of-speech (POS) tagging for social media data have been investigated recently, however, prior work has treated them separately. In this paper, we propose a joint Viterbi decoding process to determine each token’s POS tag and non-standard token’s correct form at the same time. In order to evaluate our approach, we create two new data sets with POS tag labels and non-s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computer Speech & Language

دوره 15  شماره 

صفحات  -

تاریخ انتشار 2001